"Team Stuttgart and Tübingen – GeneiusVis"
VAST 2010 Challenge
Genetic Sequences - Tracing the
Mutations of a Disease
Authors and Affiliations:
Julian Heinrich, University of Stuttgart, julian.heinrich@visus.uni-stuttgart.de
Andre Burkovski, University of Stuttgart, andre.burkovski@visus.uni-stuttgart.de
Florian Battke, University of Tübingen, florian.battke@uni-tuebingen.des
Alexander Herbig, University of Tübingen, alexander.herbig@uni-tuebingen.de
Stephan Symons, University of Tübingen, symons@informatik.uni-tuebingen.de
Kay Nieselt, University of Tübingen, kay.nieselt@uni-tuebingen.de
Tool(s):
In order to solve this mini challenge a visual
analytics tool was developed, integrating the result of calculating a
phylogenetic tree using the neighbor joining method. For the computation of the
phylogenetic trees, ClustalX was used.
ClustalX is a multiple sequence alignment program, however the alignment
procedures were not used since the sequences for the challenge were already
aligned. The phylogenetic tree was exported as a Newick file, a standard file
format used in phylogenetics. The developed tool ‘GeneiusVis’, consists of two
linked views: a Tree Visualizer offering different layouters for phylogenetic
trees as well as interactive node and edge selection, and an alignment viewer
for multiple sequence alignments allowing to trace mutations of a disease.
Selections of rows in the alignment viewer are linked to the respective nodes
in the Tree Visualizer and vice versa. The alignment viewer provides
interactive computation of consensus sequences from selected rows. The consensus
sequence represents the most frequent nucleotide at each position of the
selected sequences.
Additionally, R
was used to compute the mutual information of pairs of columns in a multiple
sequence alignment, as well as to determine the mean evolutionary divergence
between two groups of sequences. Finally, the WEKA library was used to
validate the findings.
Video:
ANSWERS:
MC3.1: What is the region or country of origin for the
current outbreak?
Answer: Nigeria_B
To determine the origin of the virus associated with an
outbreak of the Drafa virus, we conducted genetic analyses of all native
sequences and those of the current disease outbreak. Phylogenetic analyses
based on the nucleotide sequences showed that all viral sequences from the
disease outbreak are very closely related and cluster monophyletically. This
proves that all strains from the current outbreak have one common ancestor, the
strain from Nigeria B (highlighted red in Figure 1). The same answer is found
when using amino acid sequences. In addition, we computed the average
nucleotide divergence between the current Drafa viruses and the native strain
sequences, using R. The minimum is 0.010799 which again is the divergence to
Nigeria B.
Figure 1: A phylogenetic tree of all sequences in the Tree Visualizer. The
native strain sharing the lowest common ancestor with all strains of the
current outbreak is highlighted in red.
MC3.2: Over time, the virus spreads and the diversity
of the virus increases as it mutates. Two patients infected with the Drafa virus
are in the same hospital as Nicolai. Nicolai has a strain identified by
sequence 583. One patient has a strain identified by sequence 123 and the other
has a strain identified by sequence 51. Assume only a single viral strain is in
each patient. Which patient likely contracted the illness from Nicolai and why?
Answer: ID 123
After performing a phylogenetic analysis of the
nucleotide sequences restricted to the sequences of the current disease
outbreak we imported the resulting tree in the Tree Visualizer. The Alignment
Viewer can be used to sort and select IDs more efficiently. The corresponding
nodes are simultaneously selected and highlighted in the tree. We selected the
three IDs with label 583, 123 and 51, and interactively labeled them with red.
From the tree, it is evident that 123 is much closer to 583 than the sequence
with ID 51. This is validated by the evolutionary divergence of 583 and 123
which is 0.000713, while the evolutionary divergence of 583 and 51 is 0.002141.
The same answer is found when using amino acid sequences.
|
|
Figure 2: Phylogenetic tree and alignment of all current outbreak sequences.
Selection of strains in the alignment viewer automatically highlights the respective
nodes in the tree.
MC3.3: Signs and symptoms of the
Drafa virus are varied and humans react differently to infection. Some mutant
strains from the current outbreak have been reported as being worse than others
for the patients that come in contact with them.
Identify the top 3 mutations that
lead to an increase in symptom severity (a disease characteristic). The
mutations involve one or more base substitutions. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For each mutation provide the base
substitutions and their position in the sequence (left to right) where the base
substitutions occurred. For example,
C -> G, 456 (C changed to G at
position 456)
G -> A, 513 and T -> A, 907 (G
changed to A at position 513 and T changed to A at position 907)
A -> G, 39 (A changed to G at
position 39)
Answer:
A -> G, 223
A -> C, 269
T -> C, 109
For all following tasks, nucleotides and disease
characteristics (with increasing severity from 0 to 2) have been mapped to
colors and opacity.
Figure 3: Alignment Viewer with disease characteristics.
We sorted all sequences by symptom severity and
computed the consensus sequence for every severity group. The color of the
consensus nucleotide corresponds to the most frequent one of the group while
opacity reflects its relative frequency and thereby the degree of conservation
in the respective group. The most prominent correlation of columns with the opacity
of symptom severity occur in columns 22, 79, 109, 161, 223, 269, 842 and 946. As
positions 22, 79, 161, 842 and 946 turn out to be correlated with other disease
characteristics (see below), only 109, 223 and 269 remain.
Figure 4: Three consensus sequences grouped by strains with equal symptom
severity.
MC3.4: Due to the rapid spread
of the virus and limited resources, medical personnel would like to focus on
treatments and quarantine procedures for the worst of the mutant strains from
the current outbreak, not just symptoms as in the previous question. To find
the most dangerous viral mutants, experts are monitoring multiple disease
characteristics.
Consider each virulence and drug
resistance characteristic as equally important. Identify the top 3 mutations
that lead to the most dangerous viral strains. The mutations involve one or
more base substitutions. In a worst case scenario, a very dangerous
strain could cause severe symptoms, have high mortality, cause major
complications, exhibit resistance to anti viral drugs, and target high risk
groups. For this question, the biological properties of the underlying amino
acid sequence patterns are not significant in determining disease
characteristics.
For each mutation provide the base
substitutions and their position in the sequence (left to right) where the base
substitutions occurred. For example,
C -> G, 456 (C changed to G at
position 456)
G -> A, 513 and T -> A, 907 (G
changed to A at position 513 and T changed to A at position 907)
A -> G, 39 (A changed to G at
position 39).
Answer:
T -> C, 842 and A -> T, 946
G -> C, 161 and T -> C, 790
G -> C, 22 and C -> A, 79
Our general approach is to associate the genotype of a
strain with its disease characteristics, the phenotype, in a matrix based
alignment view. Symptom characteristics were mapped to integers and added as
meta information to the nucleotide alignment. An additional column summing the
symptoms for each patient was added as scoring function for overall virulance.
Columns can be moved, hidden and sorted. In the alignment viewer, each position
is colored either by nucleotide or attribute value. For attributes, a
single-hue (red) was used with opacity denoting the attribute value. If rows
are selected, a consensus sequence can be computed which is then shown instead
of the rows it represents. The nucleotide of the consensus sequence at position
i is chosen as the one with largest frequency: arg maxc {f(c,i), c
in {A,G,C,T}}. Here, opacity is mapped to the relative frequency of the
nucleotide in the consensus, reflecting the degree of conservation of a
nucleotide in the consensus. Several consensus sequences can also be joined to
a new consensus sequence, allowing the user to interactively build a ‘consensus
tree’. Finally, for attribute values the average is taken as consensus instead
of the most frequent occurrence. This makes sense, as disease characteristics have
been mapped to linear scale previously.
We first noted that many positions in the alignment
are perfectly conserved, i.e. all strains have an identical nucleotide at that
position. Since conserved positions cannot contribute to virulence of a mutant,
we (automatically) removed these positions from further considerations. We also
removed all columns for which just one strain differed from the other strains.
These singular mutations define the individual strain identity, but not mutant
virulence.
We then compared the remaining 14 columns with the
virulance score (the sum over all attribute values). The entries of the score
range from 1 to 8. First, we sorted all rows according to the virulance score.
Next we hypothesized whether all 8 levels of phenotypes are represented by
different consensus sequences. The main idea is to find visual correlations of
the opacity of an alignment column with the opacity of the phenotype
attributes. Again we searched for the visually most prominent correlation of
alignment and attribute columns. However, for the remaining 14 columns we see
that such a fine resolution into 8 phenotypes is not reflected by the consensus
sequences (Figure 5).
Figure 5: 8
consensus sequences of all sequences with virulance score levels 1, 2, ...,
8, respectively. |
Figure 6: 4
consensus sequences corresponding to 4 levels of overall virulence. For
clarity only the colors and corresponding opacity is shown. The tooltip of a cell
show the underlying distribution of nucleotides. |
Since our tool efficiently allows the visual
comparison of consensus sequences that can be quickly computed from selected
rows, we gradually reduced the resolution of the overall phenotype groups from 8
to 4. The 4 phenotypes would correspond to very low (scores 1 and 2, low (scores
3,4), medium (scores 5,6) and high (scores 7,8) (Figure 6). We see four overall
patterns of alignment columns:
1: decreasing opacity, indicating increasing mutation
rate with increased virulence
2: increasing opacity, indicating decreasing mutation
rate with increased virulence
3: one color but not with steadily increasing or
decreasing opacity, and
4: columns with more than one color, indicating that
the majority of strains in that group have a different nucleotide at that
position than all other strains.
Positions 842, 161 and 790 belong to pattern 1,
positions 22, 79 and 946 are from pattern number 4.
However, we also noted an uncertainty to decide. Other
positions could also be chosen as candidates. One observation that we made is
that individual disease characteristics lead to different contributions from
individual positions. We successively repeated our analyses with each of the 6
attribute columns, and found that positions 161 and 790 mainly lead to worse
'complications', positions 22,79 (and possibly 1033) and 842,946 nicely correlate
with 'drug resistance' (Figure 7). As ‘complications’ seem to become more
severe with ‘increasing drug resistance’, these positions obviously cause
increasing overall virulance.
Figure 7: The reduced alignment with attributes but
without labels, sorted by ‘complications’ and ‘drug resistance’.
Using color without labels, as in Figure 7, helps the
researcher to identify patterns among columns. E.g., sorting rows according to ‘drug
resistance’, we immediately see that it is correlated with position 790: all
strains with major complications have a ‘C’ (blue color), while the other
strains except for two strains with minor complications have a ‘T’ (purple
color) at that position. We also nicely see the correlated mutation patterns of
positions 22, 79 and possibly 1033 as well as positions 842 and 946.
Using the statistics package R, we also determined which
positions are correlated. Therefore we computed the pairwise mutual information
and visualized the results in a heatmap (Figure 8). The heatmap quickly allows
the identification of cells with large mutual information values, which correspond
to pairs of highly correlated columns.
Figure 8: Heatmap of pairwise mutual information of
selected columns in alignment.
We see several possible pairs of correlated mutations.
Position 161, 790, 842 and 946 are highly correlated. Furthermore positions 22
and 79 show a significant correlation, as well as with position number 161.
Altogether our top mutations for overall virulence are columns 842, 946 and
columns 22,79 and columns 790,161.
Efforts: one team member applied the tools offered by
the WEKA library to identify the best features to do a classification of the
strains with respect to their symptoms (1 day). Another team member implemented
a mutual information analysis in R in order to identify pairwise correlated
columns in the alignment (1 day). One team member implemented the alignment
viewer using the Qt toolkit in about one week. The tree visualizer was already
implemented in Java and needed only to be extended for communication with the
alignment viewer (1 day).